NumPy Array Basics - Vectorization


In [1]:
import sys
import numpy as np
print sys.version


2.7.6 (default, Sep  9 2014, 15:04:36) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)]

In [11]:
npa = np.random.random_integers(0,50,20)

Now I’ve harped on about vectorization in the last couple of videos and I’ve told you that it’s great but I haven’t shown you how it’s so great.

Here are the two powerful reasons

  • Concise
  • Efficient

The fundamental idea behind array programming is that operations apply at once to an entire set of values. This makes it a high-level programming model as it allows the programmer to think and operate on whole aggregates of data, without having to resort to explicit loops of individual scalar operations.

You can read more here: https://en.wikipedia.org/wiki/Array_programming


In [12]:
npa


Out[12]:
array([11, 29, 11, 17, 39, 24, 49,  1, 10, 48, 44, 47, 45, 48, 41,  6, 49,
       24,  0,  5])

With vectorization we can apply changes to the entire array extremely efficiently, no more for loops. If we want to double the array, we just multiply by 2 if we want to cube it we just cube it.


In [13]:
npa * 2


Out[13]:
array([22, 58, 22, 34, 78, 48, 98,  2, 20, 96, 88, 94, 90, 96, 82, 12, 98,
       48,  0, 10])

In [14]:
npa ** 3


Out[14]:
array([  1331,  24389,   1331,   4913,  59319,  13824, 117649,      1,
         1000, 110592,  85184, 103823,  91125, 110592,  68921,    216,
       117649,  13824,      0,    125])

In [15]:
[x * 2 for x in npa]


Out[15]:
[22, 58, 22, 34, 78, 48, 98, 2, 20, 96, 88, 94, 90, 96, 82, 12, 98, 48, 0, 10]

So who cares? Again it’s going to be efficiency thing just like boolean selection Let’s try something a bit more complex.

Define a function named new_func that cubes the value if it is less than 5 and squares it if it is greater or equal to 5.


In [22]:
def new_func(numb):
    if numb < 10:
        return numb**3
    else:
        return numb**2

In [23]:
new_func(npa)


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-3e04545c215c> in <module>()
----> 1 new_func(npa)

<ipython-input-22-a509179a0915> in new_func(numb)
      1 def new_func(numb):
----> 2     if numb < 10:
      3         return numb**3
      4     else:
      5         return numb**2

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

However we can’t just pass in the whole vector because we’re going to get this array ambiguity.


In [24]:
?np.vectorize

We need to vectorize this operation and we do that with np.vectorize

We can then apply that to our entire array and it takes care of the complexity for us. We can think in terms of the data without having to think about each individual element.


In [25]:
vect_new_func = np.vectorize(new_func)

In [26]:
type(vect_new_func)


Out[26]:
numpy.lib.function_base.vectorize

In [27]:
vect_new_func(npa)


Out[27]:
array([ 121,  841,  121,  289, 1521,  576, 2401,    1,  100, 2304, 1936,
       2209, 2025, 2304, 1681,  216, 2401,  576,    0,  125])

In [28]:
[new_func(x) for x in npa]


Out[28]:
[121,
 841,
 121,
 289,
 1521,
 576,
 2401,
 1,
 100,
 2304,
 1936,
 2209,
 2025,
 2304,
 1681,
 216,
 2401,
 576,
 0,
 125]

It's also much faster to vectorize operations and while these are simple examples the benefits will become apparent as we continue through this course.


In [29]:
%timeit [new_func(x) for x in npa]
%timeit vect_new_func(npa)


10000 loops, best of 3: 54 µs per loop
10000 loops, best of 3: 37.7 µs per loop

In [30]:
npa2 = np.random.random_integers(0,100,20*1000)

Speed comparisons with size.


In [31]:
%timeit [new_func(x) for x in npa2]
%timeit vect_new_func(npa2)


10 loops, best of 3: 57.4 ms per loop
100 loops, best of 3: 8.47 ms per loop